Size-Selection of Cell-Free DNA for Increasing Family Size During Next-Generation Sequencing

ABSTRACT

A method of increasing detection of low-abundant fragments of cell-free DNA (ccfDNA) in a biological sample is disclosed and discussed. Such a method can include isolating an initial fraction of ccfDNA fragments from a biological sample, ligating a unique molecular identifier (UMI) to each of the ccfDNA fragments in the initial fraction, amplifying the plurality of ccfDNA fragments to generate a ccfDNA library, isolating a short fraction of ccfDNA fragments from the ccfDNA library, where the ccfDNA fragments in the short fraction are limited to a size of less than or equal to 160 base pairs (bp), amplifying the ccfDNA fragments in the short fraction, and sequencing the ccfDNA fragments in the short fraction to generate sequenced ccfDNA fragments.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.16/137,432, filed on Sep. 20, 2018, now issued as U.S. Pat. No.11,091,800, which claims the benefit of U.S. Provisional PatentApplication No. 62/561,149, filed on Sep. 20, 2017, each of which isincorporated herein by reference in its entirety.

BACKGROUND

A portion of the DNA from healthy cells undergoing apoptosis enters thecirculation and is known as circulating cell-free DNA as it is notcontained within a cellular membrane. Tumor cells similarly deposit DNAinto the circulation, which is referred to as circulating tumor DNA(ctDNA). Use of ctDNA is becoming increasingly recognized as anon-invasive means to diagnose and detect tumor recurrence (i.e., theliquid biopsy).

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates steps performed in a method for increasing detectionof low-abundant fragments of cell-free DNA (ccfDNA) in a biologicalsample from a subject in accordance with an example embodiment;

FIG. 2A illustrates data related to insert sizes from three fractionsisolated from cell-free DNA using size-based criteria in accordance withan example embodiment;

FIG. 2B illustrates data related to insert sizes from three fractionsisolated from cell-free DNA using size-based criteria in accordance withan example embodiment;

FIG. 2C illustrates data related to insert sizes from three fractionsisolated from cell-free DNA using size-based criteria in accordance withan example embodiment;

FIG. 3A illustrates data showing variant allele frequency related toincreasingly larger family size in accordance with an exampleembodiment;

FIG. 3B illustrates data showing variant allele frequency related toincreasingly larger family size in accordance with an exampleembodiment;

FIG. 4A illustrates data showing a relationship between size-basedselection of cell-free DNA and increasing family size in accordance withan example embodiment;

FIG. 4B illustrates data showing a relationship between size-basedselection of cell-free DNA and increasing family size in accordance withan example embodiment;

FIG. 4C illustrates data showing a relationship between size-basedselection of cell-free DNA and increasing family size in accordance withan example embodiment;

FIG. 4DA illustrates data showing a relationship between size-basedselection of cell-free DNA and increasing family size in accordance withan example embodiment;

FIG. 4E illustrates data showing a relationship between size-basedselection of cell-free DNA and increasing family size in accordance withan example embodiment;

FIG. 5A illustrates data evaluating false positives for the variantalleles in association with family size in accordance with an exampleembodiment;

FIG. 5B illustrates data evaluating false positives for the variantalleles in association with family size in accordance with an exampleembodiment;

FIG. 5C illustrates data evaluating false positives for the variantalleles in association with family size in accordance with an exampleembodiment;

FIG. 6A illustrates steps in process flows prior to determination ofvariant allele frequency in accordance with an example embodiment;

FIG. 6B provides data showing a correlation between direct measurementof variant allele frequency in ccfDNA by ddPCR (flow a) and by themulti-step sequencing process (flow b) in accordance with an exampleembodiment;

FIG. 6C illustrates data in boxplots of wild type alleles and variantalleles by NGS for each cancer patient (C=colorectal adenocarcinoma;M=melanoma; P=pancreatic ductal adenocarcinoma) in accordance with anexample embodiment;

FIG. 7A illustrates data relating to the detection of variant alleles byddPCR in ccfDNA in accordance with an example embodiment;

FIG. 7B illustrates data relating to the detection of variant alleles byddPCR in ccfDNA in accordance with an example embodiment;

FIG. 7C illustrates data relating to the detection of variant alleles byddPCR in ccfDNA in accordance with an example embodiment;

FIG. 7D illustrates data relating to the detection of variant alleles byddPCR in ccfDNA in accordance with an example embodiment;

FIG. 7E illustrates data relating to the detection of variant alleles byddPCR in ccfDNA in accordance with an example embodiment;

FIG. 8A illustrates data for wild type (WT) and variant allele (VA)counts by ddPCR and NGS in accordance with an example embodiment;

FIG. 8B illustrates data for wild type (WT) and variant allele (VA)counts by ddPCR and NGS in accordance with an example embodiment;

FIG. 9 illustrates data in beeswarm plots for insert sizes<250 bpassociated with the wild type (WT) and variant allele (VA) from eachpatient in accordance with an example embodiment;

FIG. 10 illustrates data in beeswarm plots for insert sizes>250 bpassociated with the wild type (WT) and variant allele (VA) from eachpatient in accordance with an example embodiment;

FIG. 11A illustrates data relating to the effect of size selection onVAF in spiked ccfDNA libraries in accordance with an example embodiment;

FIG. 11B illustrates data relating to the effect of size selection onVAF in spiked ccfDNA libraries in accordance with an example embodiment;

FIG. 11C illustrates data relating to the effect of size selection onVAF in spiked ccfDNA libraries in accordance with an example embodiment;

FIG. 11D illustrates data relating to the effect of size selection onVAF in spiked ccfDNA libraries in accordance with an example embodiment;

FIG. 11E illustrates data relating to the effect of size selection onVAF in spiked ccfDNA libraries in accordance with an example embodiment;

FIG. 11F illustrates data relating to the effect of size selection onVAF in spiked ccfDNA libraries in accordance with an example embodiment;

FIG. 12A illustrates data showing VAFs detected by ddPCR ofsize-selected and unselected synthetically spiked ccfDNA libraries inaccordance with an example embodiment;

FIG. 12B illustrates data showing VAFs detected by ddPCR ofsize-selected and unselected synthetically spiked ccfDNA libraries inaccordance with an example embodiment;

FIG. 12C illustrates data showing VAFs detected by ddPCR ofsize-selected and unselected synthetically spiked ccfDNA libraries inaccordance with an example embodiment;

FIG. 12D illustrates data showing VAFs detected by ddPCR ofsize-selected and unselected synthetically spiked ccfDNA libraries inaccordance with an example embodiment;

FIG. 13A illustrates data relating to the enrichment of variant allelesin short ccfDNA fractions in accordance with an example embodiment;

FIG. 13B illustrates data relating to the enrichment of variant allelesin short ccfDNA fractions in accordance with an example embodiment;

FIG. 13C illustrates data relating to the enrichment of variant allelesin short ccfDNA fractions in accordance with an example embodiment;

FIG. 13D illustrates data relating to the enrichment of variant allelesin short ccfDNA fractions in accordance with an example embodiment;

FIG. 13E illustrates data relating to the enrichment of variant allelesin short ccfDNA fractions in accordance with an example embodiment;

FIG. 13F illustrates data relating to the enrichment of variant allelesin short ccfDNA fractions in accordance with an example embodiment;

FIG. 14A illustrates data for VAF by ddPCR in size-selected ccfDNAlibraries in accordance with an example embodiment;

FIG. 14B illustrates data for VAF by ddPCR in size-selected ccfDNAlibraries in accordance with an example embodiment;

FIG. 14C illustrates data for VAF by ddPCR in size-selected ccfDNAlibraries in accordance with an example embodiment;

FIG. 14D illustrates data for VAF by ddPCR in size-selected ccfDNAlibraries in accordance with an example embodiment;

FIG. 14E illustrates data for VAF by ddPCR in size-selected ccfDNAlibraries in accordance with an example embodiment;

FIG. 15A illustrates data showing the percent difference in wild type(WT) and variant counts for each ccfDNA fraction relative to unselectedccfDNA counts in accordance with an example embodiment;

FIG. 15B illustrates data showing the percent difference in wild type(WT) and variant counts for each ccfDNA fraction relative to unselectedccfDNA counts in accordance with an example embodiment;

FIG. 16 illustrates data showing median insert size for the wild type(WT) and variant allele (VA) for each ccfDNA fraction in accordance withan example embodiment;

FIG. 17A illustrates data relating to the generation of large familysizes in short ccfDNA in accordance with an example embodiment;

FIG. 17B illustrates data relating to the generation of large familysizes in short ccfDNA in accordance with an example embodiment;

FIG. 17C illustrates data relating to the generation of large familysizes in short ccfDNA in accordance with an example embodiment;

FIG. 17D illustrates data relating to the generation of large familysizes in short ccfDNA in accordance with an example embodiment;

FIG. 17E illustrates data relating to the generation of large familysizes in short ccfDNA in accordance with an example embodiment;

FIG. 18A illustrates data related to the generation of family sizes inbuffy coat DNA, unselected ccfDNA, short, medium, and long ccfDNAfractions in accordance with an example embodiment;

FIG. 18B illustrates data related to the generation of family sizes inbuffy coat DNA, unselected ccfDNA, short, medium, and long ccfDNAfractions in accordance with an example embodiment;

FIG. 18C illustrates data related to the generation of family sizes inbuffy coat DNA, unselected ccfDNA, short, medium, and long ccfDNAfractions in accordance with an example embodiment;

FIG. 18D illustrates data related to the generation of family sizes inbuffy coat DNA, unselected ccfDNA, short, medium, and long ccfDNAfractions in accordance with an example embodiment;

FIG. 18E illustrates data related to the generation of family sizes inbuffy coat DNA, unselected ccfDNA, short, medium, and long ccfDNAfractions in accordance with an example embodiment;

FIG. 19A illustrates data related to the reduction of false positives atlarger family sizes in accordance with an example embodiment;

FIG. 19B illustrates data related to the reduction of false positives atlarger family sizes in accordance with an example embodiment;

FIG. 19C illustrates data related to the reduction of false positives atlarger family sizes in accordance with an example embodiment;

FIG. 19D illustrates data related to the reduction of false positives atlarger family sizes in accordance with an example embodiment;

FIG. 19E illustrates data related to the reduction of false positives atlarger family sizes in accordance with an example embodiment;

FIG. 19F illustrates data related to the reduction of false positives atlarger family sizes in accordance with an example embodiment;

FIG. 20A illustrates data showing comparisons of coverage, on-targetfraction, and family size between unselected ccfDNA from healthycontrols and patients in accordance with an example embodiment;

FIG. 20B illustrates data showing comparisons of coverage, on-targetfraction, and family size between unselected ccfDNA from healthycontrols and patients in accordance with an example embodiment;

FIG. 20C illustrates data showing comparisons of coverage, on-targetfraction, and family size between unselected ccfDNA from healthycontrols and patients in accordance with an example embodiment;

FIG. 20D illustrates data showing comparisons of coverage, on-targetfraction, and family size between unselected ccfDNA from healthycontrols and patients in accordance with an example embodiment;

FIG. 21A illustrates data showing the effects of family size on VAF inaccordance with an example embodiment;

FIG. 21B illustrates data showing the effects of family size on VAF inaccordance with an example embodiment;

FIG. 21C illustrates data showing the effects of family size on VAF inaccordance with an example embodiment;

FIG. 21D illustrates data showing the effects of family size on VAF inaccordance with an example embodiment;

FIG. 21E illustrates data showing the effects of family size on VAF inaccordance with an example embodiment;

FIG. 22A illustrates data showing the effects of family size on VAF inthe medium ccfDNA fraction in accordance with an example embodiment;

FIG. 22B illustrates data showing the effects of family size on VAF inthe medium ccfDNA fraction in accordance with an example embodiment;

FIG. 22C illustrates data showing the effects of family size on VAF inthe medium ccfDNA fraction in accordance with an example embodiment;

FIG. 23A illustrates data showing the effects of family size on VAF inthe long ccfDNA fraction in accordance with an example embodiment;

FIG. 23B illustrates data showing the effects of family size on VAF inthe long ccfDNA fraction in accordance with an example embodiment;

FIG. 23C illustrates data showing the effects of family size on VAF inthe long ccfDNA fraction in accordance with an example embodiment;

FIG. 24 illustrates an example consensus alignment workflow inaccordance with an example embodiment;

FIG. 25A illustrates data showing false positive droplet events incontrol samples in accordance with an example embodiment;

FIG. 25B illustrates data showing false positive droplet events incontrol samples in accordance with an example embodiment;

FIG. 25C illustrates data showing false positive droplet events incontrol samples in accordance with an example embodiment;

DESCRIPTION OF EMBODIMENTS

Although the following detailed description contains many specifics forthe purpose of illustration, one of ordinary skill in the art willappreciate that many variations and alterations to the following detailscan be made and are considered included herein. Accordingly, thefollowing embodiments are set forth without any loss of generality to,and without imposing limitations upon, any claims set forth. It is alsoto be understood that the terminology used herein is for describingparticular embodiments only, and is not intended to be limiting. Unlessdefined otherwise, all technical and scientific terms used herein havethe same meaning as commonly understood by one of ordinary skill in theart to which this disclosure belongs. Also, the same reference numeralsin appearing in different drawings represent the same element. Numbersprovided in flow charts and processes are provided for clarity inillustrating steps and operations and do not necessarily indicate aparticular order or sequence.

Furthermore, the described features, structures, or characteristics canbe combined in any suitable manner in one or more embodiments. In thefollowing description, numerous specific details are described toprovide a thorough understanding of various embodiments. One of ordinaryskill in the relevant art will recognize, however, that such detailedembodiments do not limit the overall concepts articulated herein, butare merely representative thereof, and will also recognize that thetechnology can be practiced without one or more of the specific details,or with other methods, components, layouts, etc. In other instances,well-known structures, materials, techniques, or the like may not beshown or described in detail to avoid obscuring aspects of thedisclosure.

As used herein, the terms “comprises,” “comprising,” “containing,”“having,” and the like, all have the meaning ascribed to them accordingto U.S. Patent law, and can mean “includes,” “including,” and the like,which are open-ended terms. The terms “consisting of” or “consists of”are closed terms, and include only the components, structures, steps, orthe like specifically listed in conjunction with such terms, as well asthat which is in accordance with U.S. Patent law. “Consistingessentially of” or “consists essentially of” have the meaning ascribedto them according to U.S. Patent law. In particular, such terms aregenerally closed terms, with the exception of allowing inclusion ofadditional items, materials, components, steps, or elements, that do notmaterially affect the basic and novel characteristics or function of theitem(s) used in connection therewith. For example, trace elementspresent in a composition that do not affect the composition's nature orcharacteristics would be permissible if present under the “consistingessentially of” language, even though not expressly recited in a list ofitems following such terminology. When using an open-ended term in thiswritten description, such as “comprising” or “including,” it isunderstood that direct support should be afforded also to “consistingessentially of” language as well as “consisting of” language as ifstated explicitly and vice versa.

As used herein, the term “substantially” refers to the complete ornearly complete extent or degree of an action, characteristic, property,state, structure, item, or result. For example, an object that is“substantially” enclosed would mean that the object is either completelyenclosed or nearly completely enclosed. The exact allowable degree ofdeviation from absolute completeness may in some cases depend on thespecific context. However, generally speaking the nearness of completionwill be so as to have the same overall result as if absolute and totalcompletion were obtained. The use of “substantially” is equallyapplicable when used in a negative connotation to refer to the completeor near complete lack of an action, characteristic, property, state,structure, item, or result. For example, a composition that is“substantially free of” particles would either completely lackparticles, or so nearly completely lack particles that the effect wouldbe the same as if it completely lacked particles. In other words, acomposition that is “substantially free of” an ingredient or element maystill actually contain such item as long as there is no measurableeffect thereof.

As used herein, the term “about” is used to provide flexibility to anumerical range endpoint by providing that a given value may be “alittle above” or “a little below” the endpoint. However, it is to beunderstood that even when the term “about” is used in the presentspecification in connection with a specific numerical value, thatsupport for the exact numerical value recited apart from the “about”terminology is also provided.

As used herein, a plurality of items, structural elements, compositionalelements, and/or materials may be presented in a common list forconvenience. However, these lists should be construed as though eachmember of the list is individually identified as a separate and uniquemember. Thus, no individual member of such list should be construed as ade facto equivalent of any other member of the same list solely based ontheir presentation in a common group without indications to thecontrary.

Concentrations, amounts, and other numerical data may be expressed orpresented herein in a range format. It is to be understood that such arange format is used merely for convenience and brevity and thus shouldbe interpreted flexibly to include not only the numerical valuesexplicitly recited as the limits of the range, but also to include allthe individual numerical values or sub-ranges encompassed within thatrange as if each numerical value and sub-range is explicitly recited. Asan illustration, a numerical range of “about 1 to about 5” should beinterpreted to include not only the explicitly recited values of about 1to about 5, but also include individual values and sub-ranges within theindicated range. Thus, included in this numerical range are individualvalues such as 2, 3, and 4 and sub-ranges such as from 1-3, from 2-4,and from 3-5, etc., as well as 1, 1.5, 2, 2.3, 3, 3.8, 4, 4.6, 5, and5.1 individually. This same principle applies to ranges reciting onlyone numerical value as a minimum or a maximum. Furthermore, such aninterpretation should apply regardless of the breadth of the range orthe characteristics being described.

Reference throughout this specification to “an example” means that aparticular feature, structure, or characteristic described in connectionwith the example is included in at least one embodiment. Thus,appearances of phrases including “an example” or “an embodiment” invarious places throughout this specification are not necessarily allreferring to the same example or embodiment.

The terms “first,” “second,” “third,” “fourth,” and the like in thedescription and in the claims, if any, are used for distinguishingbetween similar elements and not necessarily for describing a particularsequential or chronological order. It is to be understood that the termsso used are interchangeable under appropriate circumstances such thatthe embodiments described herein are, for example, capable of operationin sequences other than those illustrated or otherwise described herein.Similarly, if a method is described herein as comprising a series ofsteps, the order of such steps as presented herein is not necessarilythe only order in which such steps may be performed, and certain of thestated steps may possibly be omitted and/or certain other steps notdescribed herein may possibly be added to the method.

As, used herein, the term “biological sample” refers to a complexmixture of biological origin, such as that obtained from a biologicalsubject, which in many cases can be a human subject.

An initial overview of embodiments is provided below, and specificembodiments are then described in further detail. This initial summaryis intended to aid readers in understanding the disclosure more quickly,and is not intended to identify key or essential technological features,nor is it intended to limit the scope of the claimed subject matter.

Circulating cell-free DNA (ccfDNA) refers to fragments of DNA that areno longer within a cell and are present in the circulatory system. Theterm ccfDNA can also be used to describe these DNA fragments followingextraction and subsequent processing. ccfDNA can be released from a cellas a result of various processes, including both normal and abnormalapoptotic events, cellular excretions, necrosis, and the like. Specificforms of ccfDNA may be present in the circulatory system as a result ofvarious medical conditions, disease states, pregnancy, and the like. Theaccessibility of these genomic fragments in the circulatory systemthrough a simple blood sample provides a low-risk opportunity to screenfor various phenotypes, conditions, and the like. As a result, ccfDNAtesting is an emerging diagnostic approach that allows for noninvasive,rapid, and real-time testing in research and clinical settings.

For example, during pregnancy, placental cells release fetal DNA intothe mother's circulatory system, which is referred to herein ascirculating fetal-derived cell-free DNA (cfDNA). cfDNA thus provides theopportunity to perform fetal genetic screening from samples of themother's blood, thus avoiding more invasive and potentially riskyamniocentesis and chorionic villus sampling (CVS) procedures. cfDNA canbe used to screen for any genetic phenotype or condition detectable inthe fetal genome, nonlimiting examples of which can include fetal sex,fetal rhesus (Rh) blood type, chromosomal disorders such as trisomy 22,trisomy 21 (down syndrome), trisomy 18, trisomy 16, trisomy 13,triploidy, sex chromosome aneuploidy, and the like, chromosomal deletiondisorders (microdeletion syndrome) such as Prader-Willi syndrome, anddisorders associated with bone or anatomical abnormalities, to name afew.

As another example, solid tissues, including cancers, also contribute tothe plasma ccfDNA pool. Such circulating tumor-derived DNA fragments,referred to herein as circulating tumor DNA (ctDNA), is a type ofcell-free DNA that can originate directly from a cancer, tumor, or fromcirculating tumor cells that have been shed from primary tumors and haveentered the bloodstream or lymphatic system. ctDNA bear the molecularsignatures of the neoplastic cell genome. Relative to microdissection oftumor tissue, which interrogates a minute and focal fraction ofintratumor genetic diversity, ctDNA can be used to sample clonalvarieties of both primary and metastatic sites through perfusionsampling. However, ctDNA is typically present at very low allelefrequencies (e.g., median of ˜0.5% in some cases) due to dilution by theabundant normal ccfDNA. As total ctDNA content is correlated withadvancing disease stage, application of ccfDNA diagnostics for earlydisease detection will likely need reliable identification of very-lowvariant allele frequencies (VAFs), which can in many cases be less than1%. In addition, intratumoral genetic heterogeneity is common and is akey challenge in cancer medicine. Identification of minor subclonalpopulations is important for detection of emerging chemoresistance,minimal residual disease, and disease progression.

Broad clinical applications based on various forms of ccfDNA have beenlimited by challenges associated with detecting such ccfDNA formsamongst the more abundant normal ccfDNA. This can be particularly truein ccfDNA forms such as non-metastatic solid tumors, for example, wherectDNA variants may be present at a very low frequency (e.g., <1%). It isilluminating to note that the sequencing error rate for standardnext-generation sequencing (NGS) protocols is ˜1%. Therefore, to detectvariants with a frequency<1% an ultra-high read depth, a reduced errorrate, an improved sensitivity, or some combination thereof is likelyneeded. Thus, accurate detection of variant alleles, particularly whenusing untargeted searches, remains an obstacle to widespread applicationof cell-free DNA in screening for clinical use, and particularly foroncology.

NGS enables a broad search for both known and unknown tumor-associatedvariants including single nucleotide variants, copy number variations,and chromosomal rearrangements. However, even the highest fidelitysequencing platforms introduce errors at ≥0.1%. Additional nucleotidechanges may be introduced in the PCR amplification steps of sequencinglibrary preparations. Such accumulation of potential false positiveerrors by sequencing and PCR limits reliable identification of truevariants that occur with <1% frequency. Several approaches have beentaken to improve very low variant calling by NGS. Ultra-deep sequencing(e.g., >30,000× read depth) improves detection of low frequency variantsin ccfDNA, but may not be cost-effective for routine diagnostic testing.

The present disclosure demonstrates techniques that overcome theseobstacles and provide high accuracy screening of ccfDNA for fragments ofinterest that are present at very low frequencies in a biologicalsample. In one such technique, sample complexity can be reduced toincrease the signal-to-noise ratio of a fragment of interest prior tosequencing. Reducing sample complexity facilitates an increase insequencing read depth for ccfDNA fragments, thus greatly increasing thesignal-to-noise ratio of variant alleles in the ccfDNA sample. While anytechnique for reducing sample complexity is considered to be within thepresent scope, in one example embodiment this can be accomplishedthrough a size-based selection of ccfDNA fragments prior to sequencing.Such enrichment can be accomplished by selecting for shorter ccfDNAfragments, an approach that is feasible as ccfDNA does not requireshearing prior to library preparation, as is common with buffy-coat DNAtechniques or tumor DNA that is genomic in length (>1 kb). As has beendescribed, ccfDNA is already in a fragmented state due to the apoptoticprocess and subsequent degradation by nucleases. The most commonfragment length for ccfDNA is about 170 bp, which corresponds to thelength of a mononucleosome. A distribution of fragment sizes is presentaround this principal peak. Additional peaks generally occur at ˜340 bp(dinucleosome) and ˜510 bp (trinucleosome). The principal fragmentlength of many ccfDNA fragments of interest is shorter. For example,many ctDNA fragments most commonly occurs at about 120-150 bp.Furthermore, selection of shorter fragments sizes from an originalcell-free DNA sample can increase VAF, thus improving sensitivity.

As is shown in FIG. 1, one example method of increasing detection oflow-abundance target fragments of ccfDNA in a biological sample from asubject can include 102 isolating an initial fraction of ccfDNAfragments from a biological sample, 104 ligating a unique molecularidentifier (UMI) to each of the ccfDNA fragments in the initialfraction, 106 amplifying the plurality of ccfDNA fragments to generate accfDNA library, 108 isolating a short fraction of ccfDNA fragments fromthe ccfDNA library, where the ccfDNA fragments in the short fraction arelimited to a size of less than or equal to 160 bp, 110 amplifying theccfDNA fragments in the short fraction, and 112 sequencing the ccfDNAfragments in the short fraction. The process of isolating shorterfragments from the ccfDNA library prior to sequencing increases the VAFto improve sensitivity, thus allowing low concentration targetfragments, such as ctDNA, cfDNA, and the like, to be enriched.

A biological sample can include any sample taken from a subject that caninclude ccfDNA. One common biological sample is blood, includingcomponents of blood such as serum or plasma. In some examples, thebiological sample is taken from the subject being screened, while inother examples the subject being screened can be different from thesubject from which the sample was taken, such as would be the case for apregnant mother providing a biological sample for screening a fetus.Thus, the extraction of the biological sample can vary depending on thenature of the sample itself and the particular screening assay beingperformed. In one example, a blood sample can be centrifuged to separateinto the well-known blood cell, buffy-coat, and plasma layers. ccfDNA isgenerally present in the plasma layer following centrifugation, and canbe isolated therefrom (i.e., the initial fraction of ccfDNA). Onceisolated, a UMI can be ligated to each ccfDNA fragment in the initialfraction. It should be noted that the ligation reaction may not ligate aUMI to all ccfDNA fragments present in the sample. It is thus intendedthat the phrase “ligated to each ccfDNA fragment in the initialfraction” define the initial fraction as including only those ccfDNAfragments that were successfully ligated.

The UMI can include any type of molecular identifier, includingmolecular barcodes for example, that are capable of uniquely identifyingeach ccfDNA fragment and being amplified with the associated ccfDNAfragment. UMIs can include external adapters, internal adapters, orcombination thereof. Such adapters can be custom adapters, standardadapters, or standard adapters with custom modifications. The UMIs thusenable tracking of each ccfDNA fragment duplicate during the PCRamplification process.

Once the ccfDNA library has been constructed following the amplificationof the ccfDNA fragments from the initial fraction, the short fraction ofccfDNA fragments can be isolated for further amplification andsequencing. The isolation of the short fraction can be accomplished byany technique capable of extracting ccfDNA fragments based on size(i.e., length), which is not to be seen as limiting. In one example, theextraction technique can utilize an electrophoretic gel-based process,including polyacrylamide and agarose gels. Such gel-based extractionscan include manual techniques (e.g., gel cutouts), fully automatedtechniques, and semi-automated techniques. One nonlimiting example of anon-gel extraction technique includes liquid chromatography extractions.

As one example of a gel-base extraction technique, ccfDNA fragments fromthe ccfDNA library are electrophoretically migrated through anelectrophoretic gel to separate the ccfDNA fragments based on size. Atarget portion of the electrophoretic gel is then selected thatcorresponds to the ccfDNA fragment size of the short fraction, and thetarget portion is extracted from gel to isolate the short fraction ofccfDNA fragments. As has been described, this process can be a manualextraction, where the target portion is manually cut from the gel. Ashas also been described, the electrophoresis and subsequent extractioncan be accomplished in an automated or semi-automated fashion, which insome cases can result in increased migration and/or extraction accuracy.

Once isolated from the longer ccfDNA fragments of the ccfDNA library,the ccfDNA fragments of the short fraction can be PCR or otherwiseamplified. Thus, by first reducing the number of unique ccfDNA fragmentsin a sample (e.g., to shorter ccfDNA fragments), followed by PCRamplification of the size-reduced fraction of ccfDNA, the PCR enzymaticchemistry becomes focused on fewer ccfDNA fragments, which generatesmore amplicons of the same DNA molecule to effectively increase familysize.

As has been described, in many examples the size of ccfDNA fragments inthe selected short fraction is less than the peak of the mononucleosome,which is about 170 bp. The specific fragment size cutoff for the shortfraction can vary, depending on the size of the fragment of interest,the screening design, and the like. In one example, the size cutoff forthe short fraction can be less than or equal to 160 bp. In anotherexample, the size cutoff for the short fraction can be less than orequal to 155 bp. In yet another example, the size cutoff for the shortfraction can be less than or equal to 150 bp. In a further example, thesize cutoff for the short fraction can be less than or equal to 145 bp.In another example, the size cutoff for the short fraction can be lessthan or equal to 130 bp.

In another example, isolating the short fraction of ccfDNA fragmentsfrom the ccfDNA library can include separating ccfDNA fragments from theccfDNA library by liquid chromatography into fractions according toccfDNA fragment size. Either during the separation process, or usingisolated fractions following separation, a target fraction is selectedthat corresponds to the short fraction based on ccfDNA fragment size,which is extracted or otherwise utilized as the short fraction forsubsequent amplification and sequencing.

Regardless of the specific techniques utilized, once amplified theccfDNA fragments in the extracted short fraction are sequenced using anyknow and useful sequencing technique. In one example, the short fractioncan be sequenced by any form of NGS procedure. The size-based extractionof the short fraction from the fragment library, combined with thesubsequent amplification of the ccfDNA fragments from that fraction,allows NGS with sequencing error rates well below the ˜1% standard errorrate for such protocols, which in many cases can be ˜0.05%, ˜0.01%, orlower.

Following sequencing, ccfDNA fragment sequences can be grouped accordingto each UMI. DNA amplicons having the same UMI are considered to be afamily, as each was derived from the same initial ccfDNA fragment. Assuch, DNA amplicons can be grouped according to the same UMI. In someexamples, DNA amplicons having largely similar UMIs (e.g., >0.875) canalso be grouped together, either with the group of DNA amplicons havingthe same UMI or as a separate group. Following grouping, a singleconsensus sequence is built for each UMI group. The technique forbuilding the consensus sequence is not limiting. As one example,however, all of the sequences in a given UMI group are aligned and eachbase position in the consensus sequence is represented with the mostcommon base in the family for that position.

Subsequent use of consensus sequences can vary depending on theintention of the screen. In many cases, consensus sequences can becompared against a sequence library of target sequences associated withgenetic conditions and matching the consensus sequence to a targetsequence in the sequence library. If, for example, the genetic conditionis a medical condition, the specific medical condition associated withthe matched target sequence can be identified. Once identified, themedical condition can be diagnosed in the subject, and an appropriatemedical treatment can be prescribed or performed on the subject in orderto treat the medical condition. In one example, such a medical conditioncan include a solid tumor, and thus the consensus sequence is fromctDNA. Treatment for such can vary depending on the specific type oftumor.

In another example, the genetic condition can be a genetic phenotype,which can include a normal genetic phenotype or an abnormal geneticphenotype. Examples include those listed above in the description ofcfDNA. In such cases, it is noted that the subject providing thebiological sample is a pregnant mother and the genetic phenotype is afetal genetic phenotype.

Example Study 1 Size-Selection of Cell-Free DNA Increases Variant AlleleFrequency

ccfDNA from N=13 patients with solid tumors and known BRAF or KRAStumor-derived variants present in ccfDNA underwent adapter ligation andUMI assignment. Samples were subsequently PCR amplified to generatelibraries. Two specific size-range fractions were targeted forextraction from the ccfDNA library (1 μg) using an automated process(Ranger Technology, CoastalGenomics, Burnaby, Calif.) during a singlerun. An intermediate third size range was targeted for extraction on asecond independent run from the ccfDNA library (1 μg) using the sameautomated process. The extracted fractions were amplified and thensequenced using a 128-gene panel (128 kb) on a HiSeq 2500 125 cyclepaired end reads. Sequencing data was aligned, consensus sequencesidentified, and family size information for each consensus sequencedenoted.

The insert sizes from each fraction were overlapping (FIG. 2A), howeverthe median insert size from each size-selection yielded discretefractions (FIG. 2B). Variant allele frequency increased the most in the‘short’ fraction (FIG. 2C). FIG. 2A specifically shows insert sizes fromeach of the three fractions (‘short’, ‘medium’, ‘long’) isolated fromcell-free DNA using size-based criteria. The distribution of insertsizes prior to isolation of specific size-based fractions is labeled as‘unselected.’ Note that there is overlap in insert sizes betweenfractions. However, the median insert size from each fraction issignificantly different between fractions, as is shown in FIG. 2B. Thevariant allele frequency increased the most in the ‘short’ fraction,while the variant allele frequency decreased the most in the ‘long’fraction, as is shown in FIG. 2C. The largest gain in variant allelefrequency occurred in samples that began with the lowest variant allelefrequency, as is shown in the insert in FIG. 2C. ***=P<0.001, NS=notsignificant.

Size Selection does not Adversely Affect Variant Allele Frequency

Change in variant allele frequency related to increasingly larger familysize was then evaluated. In the ‘unselected’ cell-free DNA, variantallele frequency was relatively constant up to a family size of ≥10;however, subsequent incremental increases in family size caused loss ofvariant alleles (FIG. 3A). In contrast, variant allele frequency wasrelatively constant in the ‘short’ fraction isolated from cell-free DNAover as similar family size range (FIG. 3B). FIGS. 3A & 3B shows variantallele frequency (VAF) as a function of family size for the ‘unselected’(FIG. 3A) cell-free DNA and the ‘short’ fraction of cell-free DNA (FIG.3B). In the ‘unselected’ cell-free DNA, the VAF reduced to zero in someof the samples when family size became >10 (FIG. 3A inset). In the‘short’ fraction of cell-free DNA (FIG. 3B), the VAF remained relativelyconstant up to a family size of 15 (FIG. 3B inset).

Size-Based Selection of Cell-Free DNA Increases Family Size

In addition to the ‘unselected’ cell-free DNA and ‘short’ cell-free DNA,libraries of buffy-coat DNA from the patients were similarly made usingUMIs. In contrast to the cell-free DNA process, buffy-coat DNA wassheared and underwent only a single round of PCR amplification prior tocapture enrichment for sequencing. The shearing of buffy-coat DNA isnecessary due to the genomic size length of buffy-coat DNA that is notamenable to next-generation sequencing. The shearing of buffy-coat DNAgenerates an abundance of unique DNA molecules as the shearing processis random. In contrast, cell-free DNA is NOT sheared during the libraryformation process as the length of cell-free DNA is compatible withnext-generation sequencers.

For a similar number of total reads (FIG. 4A), the average number ofconsensus aligned reads was greatest in the buffy coat DNA, followed bythe ‘unselected’ cell-free DNA, and then the ‘short’ cell-free DNA (FIG.4B). The consequent effect was larger family sizes in the ‘short’cell-free DNA as there were fewer unique DNA molecules, which allottedmore reads to PCR replicates (FIG. 4C). Thus, consensus read depthchanged less in the ‘short’ cell-free DNA at the locations for thevariants of interest (FIG. 4D) and read depth was greatest in the‘short’ cell-free DNA at larger family sizes (FIG. 4E). In very-lowfrequency variants, there was a loss of variant allele frequency atlarger family sizes in the ‘unselected’ cell-free DNA (FIG. 3A inset).This is explained by the reduced read depth. The relatively persistentvariant allele frequencies seen in the ‘short’ cell-free DNA at largerfamily sizes (FIG. 3B inset) is attributable to the greater read depthat larger family sizes and the increased variant allele frequency thatsize-based selection of cell-free DNA affords. Total reads were similarfor buffy coat DNA (‘Buffy’), the ‘unselected’ cell-free DNA, and the‘short’ cell-free DNA (FIG. 4A). Aligned consensus read depth (i.e.,family size≥1) was greatest for the buffy coat DNA (FIG. 4B). Familysize was statistically largest in the ‘short’ cell-free DNA due toreduction of unique DNA molecules compared to both the buffy coat DNAand the ‘unselected’ cell-free DNA (FIG. 4C). At locations associatedwith variants of interest, the read depth was the most consistentregardless of family size for the ‘short’ cell-free DNA (FIG. 4D,circles). At a family size of ≥15, consensus read depth was greatest forthe ‘short’ cell-free DNA (FIG. 4E). *** P<0.001, ** P=0.001, NS=notsignificant.

Larger Family Sizes Reduce False Positives

False positives for the variant alleles were then evaluated inassociation with family size. In the buffy coat DNA, false positivesreduced with larger family sizes (FIG. 5A). In cell-free DNA fromhealthy controls, false positives similarly reduced with larger familysizes (FIG. 5B). Of note, different variants were associated withdifferent levels of false positives (FIG. 5C), which suggests that falsepositives may be larger or smaller at different locations as compared tothat presented herein. As such, the largest family sizes as possible arenecessary to minimize false positives. Counts of corresponding variantalleles (false positives) in the buffy coat from the N=13 cancerpatients are shown in FIG. 5A. In FIG. 5B, cell-free DNA from healthycontrols is probed for the BRAF V600E variant. Note that allparticipants had at least one variant allele present at family size≥1,and one participant had a variant allele up to a family size≥13. In FIG.5C, the KRAS G12D variant was probed in healthy controls and few variantalleles were identified.

Collectively, these findings demonstrate the true impact of size-basedreduction of sample complexity to generate larger family sizes andimprovement in both sensitivity through ctDNA enrichment and specificitythrough reduction in false positives. Such size-based selection ofcell-free DNA generates larger family sizes during NGS applications.Data was shown that supports the impact of the methodology throughreduction of false positives while maintaining and/or improvingsensitivity. This technology is particularly useful in detectingvery-low frequency (<1%) variants in cell-free DNA where false positivesdirectly affect confidence of differentiating true variants from PCR andsequencing errors. Moreover, this technology has implications thatextend beyond oncology. In particular, this methodology may have utilityin non-invasive prenatal screening to improve detection and genotypingof fetal DNA, which has been shown to have a similar association withshorter cell-free DNA fragment sizes as described herein.

Example Study 2

In this study, a high-throughput-capable automated gel-extractionplatform was optimized and implemented to isolate subfractions of themononucleosomal peak in sequencing libraries of ccfDNA from patientswith melanoma, colorectal adenocarcinoma, and pancreatic ductaladenocarcinoma with confirmed somatic BRAF or KRAS variants. The studysought to determine if selection of shorter ccfDNA fragments increasedVAF of tumor-associated variants as detected by ddPCR and NGS. Alsostudied were the NGS data to identify the potential effects of sizeselection to reduce ccfDNA sample complexity for generating more PCRduplicates (i.e., larger family sizes). The effects of incrementallylarger family sizes on the occurrence of false positives in thesequencing libraries from healthy controls was also investigated.Additionally, the patient-derived NGS libraries were analyzed todetermine whether true VAF remained constant over a wide range of familysizes. In so doing, this study characterizes the potential of automatedsize-based selection of ccfDNA fractions to improve detection of ctDNAduring NGS applications by simultaneously enriching for ctDNA andreducing false positives associated with PCR and sequencing errors.

The results of Study 2 support the automated selection of shorter ccfDNAfragments as a multifactorial approach to improve very low frequencyctDNA detection using NGS. Building upon the findings from lung cancerin Study 1, Study 2 used both ddPCR and NGS to extend to melanoma,colorectal adenocarcinoma, and pancreatic ductal adenocarcinoma thestrengths of size selecting for short ccfDNA fragments to enrich forctDNA. Furthermore, a high-throughput capable automated size selectiontechnology was implemented that substantially improves the potential totranslate these findings to broader research and clinical applications.Evidence was found that selection of short ccfDNA fragments enriches forctDNA through isolation of fragment sizes containing a greaterproportion of variant alleles, while concomitantly reducing wild typealleles that are more abundant at longer fragment lengths. Thus, for agiven read depth the detection of ctDNA is more likely in the shortccfDNA fraction where the VAF is greatest. Finally, reduction of samplecomplexity through a priori size-based selection of ccfDNA generatedlarger family sizes for the subsequent in silico suppression of PCR andsequencing errors occurring at very low frequency. Size selectionimproved error correction through generation of larger family sizeswithout adversely affecting variant detection in the short ccfDNAfraction. Collectively, these findings identify the isolation of shorterccfDNA fragments as a methodology to simultaneously improve bothsensitivity and specificity of very low frequency ctDNA detection viaNGS.

Investigation of ccfDNA size distribution originated in studies ofmaternal and fetal DNA in the plasma of pregnant women. As laterconfirmed for ctDNA, fetus-derived ccfDNA showed an increased occurrenceof short fragments when compared to maternal ccfDNA with the predominantpeak fraction at ˜143-146 bp and a noted absence of dinucleosomal units.With a goal of increasing sensitivity of non-invasive prenatal testing,several studies were successful in moderately enriching for the fetalfraction of circulating DNA through preparative size separation bygel-electrophoresis or microsystem. However, the latter methodologieslimited size selection to fragments<300 bp, effectively eliminatingdinucleosomal and larger plasma DNA without achieving enrichment for the143 bp fetal peak fraction over the 166 bp maternal component. Forenrichment of ctDNA, we have previously used polyacrylamide gelelectrophoresis to provide high-resolution manual extraction of targetedccfDNA fractions. Although this approach enriched for ctDNA in a smallcohort of patients with lung cancer, the methodology was cumbersomewhich limited scalability and broader use. In Study 2, it isdemonstrated that high-resolution size-based fractionation to isolatesub-nucleosomal populations is technically feasible using ahigh-throughput-capable automated gel-extraction platform. Doing soextended the previous observations that selection of short ccfDNAfragments enriches for ctDNA in a broader array of cancer types,including at least melanoma, colorectal adenocarcinoma, and pancreaticductal adenocarcinoma. Thus, a priori size selection of specific ccfDNAfractions may have greater translational clinical implications for fullyharnessing the informational power of ccfDNA size differences inprenatal and cancer diagnostics.

In silico analysis is an alternative strategy to the a priori physicalselection of shorter ccfDNA fragments. The incorporation of in silicosize analyses of maternal plasma DNA content has facilitatedidentification of fetal content and diagnosis of fetal aneuploidies.More recently, ccfDNA fragment size has been integrated into in silicofiltering algorithms to significantly improve the positive predictivevalue of second-generation prenatal fetal whole genome analysis.However, in silico size selection may not enable the potentiallyadvantageous variant allele enrichment afforded by a priori physicalccfDNA fragment size selection. In accord, the approach described hereinfor enrichment may have the greatest use in the search fornon-metastatic solid-tumors where ctDNA frequency is commonly <2%.Alternatively, use of in silico size selection in combination withphysical size selection may further improve ctDNA detection throughelimination of the longer fragments observed to migrate with the shortertargeted range. Isolation of short ccfDNA fractions did not adverselyaffect VAF by NGS. Although a greater variance in the calculatedgain/loss factor was observed when VAF was determined by ddPCR, thisfinding was most pronounced for low VAF samples and may be attributableto the amount of library sampled. The ddPCR input of 50 ng comprisesonly ˜1.5-2.5% of the total amount of library. Sampling errors can leadto exaggerated gain/loss results, particularly in low VAF samples wheresmall numbers of under- or over-sampled copies can have dramatic effectson VAF. In contrast, 500 ng (15-25% of total library) were used forhybrid capture and subsequent NGS, likely leading to a more robustestimation of VAF gain/loss factor achieved in each size fraction.

The use of unique molecular identifiers improve ctDNA specificity.Although duplex molecular barcoding (the assignment of unique molecularidentifiers to both strands of DNA) has a very high theoreticalpotential to reduce sequencing and PCR errors, the associated lowligation efficiency (10-20%) has limited applications seeking to detectvery low frequency ctDNA variants due to sample loss. As an alternative,unique molecular identifier performance has been enhanced with errormodeling derived from healthy control data substantially reducing falsepositives, particularly during searches of known variants. Errormodeling achieves comparable error reduction as using a barcoded familysize≥5 alone. Error modeling may be advantageous as generation of largefamily sizes has been previously described to lead to a similar loss insample as duplex molecular barcoding. It is shown in Study 2 that apriori physical size selection of ccfDNA generated larger family sizesand selection for the short ccfDNA fraction enabled continued successfuldetection of known variants with a VAF≥0.39% at a family size≥20 and anaverage read depth of ˜516×. However, using the same sequencingparameters in future investigations may not allow detection at lowerVAFs due to the progressive reduction in read depth associated withincrements in family size. This was evident in the unselected ccfDNAwhere the lower VAF, increased sample complexity, and reduced read depthat larger family sizes led to loss of variant detection. Thus, selectinga sufficient read depth for a targeted VAF within the context of usingfamily size data for in silico error correction can overcome thispotential issue. It is noteworthy that the analysis in Study 2 foundpersistence of stochastic errors even at the largest family sizes with afrequency in the range consistent with very low frequency ctDNA variants(0.1%<VAF<1%). As such, error modeling alone may not completelyeliminate false positives during untargeted searches of ctDNA.Collectively, these observations support the conjecture that uniquemolecular identifiers, a priori physical size selection of short ccfDNAfragments, and sufficient read depth will improve variant detection inearly-stage non-metastatic solid tumors or low-frequency aggressiveclones in advanced cancers not only through ctDNA enrichment, but alsoby improving in silico error correction through production of largerfamily sizes.

In addition to size selection of ccfDNA, alternative methods may alsoimprove sensitivity and specificity of ctDNA detection at various stepsin the process of generating NGS data. Using a reduced amount of inputccfDNA at library preparation potentially reduces sample complexity andimproves generation of larger family sizes. The range of input ccfDNAfrom patients in this study was 10-56.6 ng (mean: 20.1±14.5 ng) derivedfrom the greater of 10 ng or 1 mL of plasma ccfDNA equivalent. Althoughvariants down to a VAF of 0.39% at a mean read depth of ˜5,500× at FS≥1were identified, using less input material may adversely affectsensitivity, particularly for detection of variants with an even lowerVAF where using more starting material may be advantageous to minimizetype II error due to sampling. Increasing the read depth has thepotential to achieve both larger family sizes and greater sensitivity.In this study, a similar number of total reads was used across allsamples to evaluate the effects of selecting for subfractions of ccfDNA.Using a larger number of total reads may not achieve a uniform increasein sensitivity and family size between samples as the effect would beless in samples of greater complexity. Thus, evaluating sensitivity andfamily size within the context of varying sample complexity can be usedto determine optimal read depth for a desired VAF. In silico analysisusing more stringent criteria (i.e., fraction of bases supporting theconsensus call, higher quality scores, etc.) may also be an effectiveapproach to reduce false positives. For example, in the present study analignment score of MQ≥20 and base score Q≥20 with >0.66 concordancebetween bases was used during consensus identification. Increasing thesevalues may improve specificity, but at the risk of adverse effects onsensitivity. Additionally, PCR-based methods using molecular barcodesrepresent an alternative approach to the capture-based NGS methods usedin this stud. While amplicon-based sequencing panels can be useful forquerying a small number of mutational hotspots, they rely on consistentamplification of all target sequences in a multiplex PCR step.Therefore, the ability to customize or expand panels by addition ofprimer pairs is limited. In addition, due to the highly fragmentednature of ccfDNA, amplicon-based approaches can only account for ccfDNAmolecules containing intact amplicons, while hybridization probespotentially capture additional unique ccfDNA molecules. Also,hybridization capture sequencing panels can range from small, targetedpanels to whole exome or whole genome coverage and size selection hasthe potential to benefit ctDNA detection using any size panel. Overall,each approach has strengths and weaknesses. A balance between cost,sensitivity, and specificity needs strong consideration when designing astudy and determining utility of each method or combination of methods.In this study, automated size-based selection of ccfDNA was found tosupport both sensitivity and specificity. However, costs associated withlabor, equipment, and reagents merit strong consideration prior tointegration of size selection for ccfDNA subfractions into a researchand clinical laboratory workflow.

Preparation of libraries for physical size selection and NGS isaccomplished using a multi-step process with multiple rounds of PCRamplification. A direct comparison of ctDNA VAFs determined by ddPCR inccfDNA and by multi-step NGS in captured ccfDNA libraries indicated thatdetectable VAFs were not adversely affected by the NGS methodology usedin this study. Although a reduced association for VAF<1.5% by ddPCR wasnot observed, this may be attributable to sample size as both thehighest (1.31%) and lowest (0.39%) VAF by ddPCR within this subset wereincreased in the NGS data (2.3% and 0.78%, respectively). A similarcomparison of VAFs by sequencing of ccfDNA libraries and by direct ddPCRof the corresponding plasma DNA has been reported. While such alsodemonstrated a high correlation of NGS and ddPCR, VAFs in NGS weregenerally lower (˜2×) than those detected directly by ddPCR. A similardrift was not observed in VAF. This suggests that conversion of ctDNAand non-tumor ccfDNA fragments into NGS libraries may be biased againstctDNA in certain methods of library preparation. Evidence that such biascould at least in part be accounted for by the size difference innon-tumor versus tumor-derived fragments is provided by a previous studywhich demonstrated that the choice of library preparation methoddirectly influences representation of shorter or damaged ccfDNAmolecules. In the present study, however, evidence of bias against ctDNAduring NGS library preparation was not observed. Rather, it was foundthat both WT and variant counts observed by NGS were similar to expectedNGS counts using ddPCR data as a reference. The similarities in countnumber between ddPCR and NGS suggests losses are associated with bothapproaches. Thus, methods that reduce loss with either technique mayfurther improve overall sensitivity.

Example of Size Selection for Short ccfDNA Fragments Enriches for ctDNA

VAFs in ccfDNA Determined by ddPCR and NGS are Strongly Correlated

High-resolution size selection of ccfDNA libraries prior to sequencinginvolves multiple PCR amplification steps, as is shown in the example ofFIG. 6A. It was initially determined whether the library preparationprocess or subsequent PCR amplification steps result in drift of thedetectable VAF. Samples from 13 patients with a BRAF or KRAS variantpresent in solid tumor tissue from melanoma (N=8), colorectaladenocarcinoma (N=3), or pancreatic ductal adenocarcinoma (N=2) and acorresponding quantifiable variant present in ccfDNA by droplet digitalPCR (ddPCR) were analyzed (Table 1). ddPCR performed on ccfDNA prior toligation of adapters and library formation determined VAF (FIG. 6A) andfacilitated direct molecular counting of amplifiable unique wild type(WT) and variant counts (FIGS. 7A-E). After addition of truncatedadapters with unique molecular identifiers, subsequent extension tofull-length adapters, panel capture, and multiple PCR amplificationsteps, WT counts, variant counts, and VAF were determined by NGS (FIG.6A). Each NGS count (either WT or variant) for all reported resultshereafter is an aligned consensus read derived from PCR duplicates withthe same unique molecular identifier. Thus, each count represents asingle unique molecule from the original ccfDNA sample prior to libraryPCR amplification. To evaluate for losses associated with NGS, WT andvariant counts obtained from ddPCR were extrapolated to determine theexpected number of WT and variant counts from NGS assuming a losslesssystem for a given amount of ccfDNA library input. There was asignificant correlation between the extrapolated ddPCR counts and NGScounts for WT alleles (Pearson's r=0.72, P=0.005) and variant alleles(r=0.96, P<0.001). For both WT and variant counts an increased number ofcounts was detected by NGS over ddPCR for most of the samples. Thatdifference was statistically significant for the WT alleles (57.5±86.7%,P=0.034; FIG. 8A), but not variant alleles (77.1±162.9%, P=0.11; FIG.8B). Undercalling of ddPCR is likely a reflection of the requirement ofdetectable DNA fragments to contain intact amplicon regions. Suchnon-amplifiable alleles may be detectable by NGS after hybrid capture.Conversely, loss of allele counts detectable by NGS could be aconsequence of inefficient adapter ligation, post ligation cleanup, ornon-uniform hybrid capture. While both methods may be subject to lossesand regardless of absolute count differences, the VAF was stronglycorrelated between ddPCR and NGS (r=0.97, P<0.001; FIG. 6B). In thesubset of samples with VAF<1.5% by ddPCR the association persisted(r=0.68, P=0.046; FIG. 6B, inset). Thus, sample preparation and analysisby NGS used in this study did not adversely affect VAF in ccfDNA ascorroborated by ddPCR.

TABLE 1 Demographics and variant allele frequency (VAF). VAF by IDCancer Type Age, yrs Gender Stage Allele ddPCR, % C1 Colorectal 69 M IVKRAS G13D 12.26 C2 Colorectal 78 F III BRAF V600E 2.43 C3 Colorectal 63M IV KRAS G12D 0.65 M1 Melanoma 49 F IIIC BRAF V600E 0.81 M2 Melanoma 80M IIIC BRAF V600K 1.11 M3 Melanoma 53 F IV BRAF V600E 0.39 M4 Melanoma45 M III BRAF V600E 1.31 M5 Melanoma 67 F IV BRAF V600E 3.74 M6 Melanoma67 M IV BRAF V600K 0.88 M7 Melanoma 39 M IV BRAF V600E 5.78 M8 Melanoma50 M IV BRAF V600K 1.10 P1 Pancreatic 63 M IV KRAS G12D 0.48 P2Pancreatic 54 F IV KRAS G12V 0.43

FIG. 6A shows a process flow depicting steps prior to determination ofvariant allele frequency (VAF). With known variants, VAF can bedetermined directly from ccfDNA with ddPCR (flow a), while sequencingrequires a multi-step process (flow b). The addition of truncatedadapters followed by extension to full-length in separate steps (flowsb-e) is done to improve resolution during size selection (flows c, d) ofdesired subfractions of ccfDNA. There was a strong association (FIG. 6B)between direct measurement of VAF in ccfDNA by ddPCR (FIG. 6A, flow a)and by the multi-step sequencing process (FIG. 6A, flow b). Thisassociation was present even at VAFs<1.5% (FIG. 6B inset). The equationsfor each colored regression line are shown in a corresponding color. In(FIG. 6C), boxplots of wild type alleles and variant alleles by NGS areshown for each cancer patient (C=colorectal adenocarcinoma; M=melanoma;P=pancreatic ductal adenocarcinoma). In (FIG. 6C), data are only shownfor insert sizes≤250 to focus results on the mononucleosome as thatlength approximates the midpoint between the mononucleosome anddinucleosome lengths associated with ccfDNA. The horizontal lineidentifies the median insert size (167 bp) from all patients. In themajority of patients, the median insert size of the tumor-associatedvariant allele was shorter than the corresponding wild type allele.

FIGS. 7A-E show detection of variant alleles by ddPCR in ccfDNA. PlasmaccfDNA was isolated from 13 cancer patients with confirmed solid tumorvariants in BRAF or KRAS (Table 1). Between 7 and 46 ng of ccfDNA wasanalyzed, depending on the concentration of cell-free DNA in the plasma.Positive control samples were generated from commercial standards(Horizon Discovery; HD701: BRAF V600E, KRAS G13D; HD239: BRAF V600K;HD272: KRAS G12D; HD289: KRAS G12V) and sheared to mimic cell-free DNAsize distribution. Cell-free DNA extracted from the plasma of healthycontrols and water were included as wildtype-only and no-template assaycontrols, respectively. Primary ddPCR data plots and gated areasgenerated by the RD Analyst software are shown. Gates for wildtype andvariant droplet clusters were set using positive control samples andsubsequently applied to negative controls and patient samples. The shownvariant allele frequency was calculated from the observed dropletcounts. Variant copy number per milliliter plasma was extrapolated basedon effective analyzed reaction volume, DNA input volume, total extractvolume and total plasma volume extracted.

FIGS. 8A-B show wild type (WT) and variant allele (VA) counts by ddPCRand NGS. The ddPCR WT (FIG. 8A, x-axis) and VA (FIG. 8B, x-axis) countswere derived from ddPCR data and represent the extrapolated number ofcounts expected to be seen by NGS assuming a lossless system for a givenamount of ccfDNA library input. In both FIGS. 8A and 8B, the solid lineis the line of unity. In FIG. 8B, the inset is a magnificationidentified by the box. The legend identifies counts associated with eachvariant.

Automated Size Selection Reproducibly Isolated ccfDNA Subfractions

In EGFR-mutant lung cancer samples, VAF increases in association withshorter ccfDNA fragments. In 13 patients it was observed that the medianinsert sizes corresponding to the mononucleosome (<250 bp) associatedwith the variant alleles present in either BRAF or KRAS were generallyshorter than the corresponding wild type alleles (151.8±12.8 vs.166.9±2.5 bp, respectively; P=0.001; FIG. 6C). Individual beeswarm plotsfor each patient for both WT and variant are shown in FIG. 9 (insertsize<250 bp, mononucleosome) and FIG. 10 (insert size>250 bp,dinucleosome and larger). In order to enrich for variant alleles, anautomated agarose gel-based extraction method was optimized and testedfor the consistent selection of size-based ccfDNA fractions (FIGS.11A-B). Subsequently, enrichment capabilities of gel-based ccfDNAfragment size selection were established using pooled ccfDNA fromhealthy controls spiked with synthetic EGFR T790M fragments (length 130bp) and BRAF V600E fragments (length 165 bp) at similar VAF. Truncatedadapters were added, PCR amplified to produce libraries of spiked andunspiked ccfDNA, and then mixed to yield an eight sample dilution seriesof unselected spiked ccfDNA libraries with VAFs ranging from 0.01% to13.1% as measured by ddPCR (FIG. 12A-D). Short and long fractions werethen extracted from 1 μg of PCR-amplified truncated ccfDNA libraries totarget the EGFR and BRAF spike-in variants, respectively. Full-lengthlibraries were generated from size-selected fractions and VAF wasdetermined by ddPCR (FIG. 12A-D). Isolation of the short fractionincreased VAF of the EGFR T790M variant (130 bp) in each dilution (FIG.11C). There was a strong association between dilution factor andenrichment (Pearson's r=0.92, P=0.009; FIG. 11D) indicating greatestenrichment occurred at the lowest VAF. The EGFR T790M variant was absentin the long fraction (FIG. 11C). The VAF of the BRAF V600E variant (165bp) was consistently greater than the unselected VAF in both the shortand long fractions (FIG. 11E), but the extent of enrichment remainedrelatively constant across all dilutions (FIG. 11F). As such, theenrichment observed for the 165-bp BRAF V600E variant was likely due toelimination of wildtype BRAF found in the dinucleosomal and largerplasma DNA components. Collectively, these findings characterizeelectrophoretic mobility of ccfDNA under the prescribed experimentalconditions. Specifically, for a given size of ccfDNA the distribution isnot fully Gaussian. For a targeted size range, longer rather thanshorter fragments outside the desired range are more likely to bepresent as exemplified by the absence of the EGFR T790M variant (130 bp)in the long fraction and the presence of the BRAF V600E variant (165 bp)in the short fraction. This is further shown in the densitometry plotswhere a tail of longer fragment sizes is present in the short fraction(FIG. 11A, arrow). These results also support the automated agarosegel-based extraction method for reproducibly and accurately separatingsubpopulations of the mononucleosomal ccfDNA component after NGS librarypreparation.

FIG. 9 shows beeswarm plots for insert sizes<250 bp associated with thewild type (WT) and variant allele (VA) from each patient. The solid grayline corresponds to the overall median insert size from all patients of167 bp. The solid light or dark blue line for WT or VA identifies thecorresponding median insert size for that patient. In some instance itis not visible (e.g., M1) as it is behind the gray line. The identifiersunder each plot are matched to Table 1 and FIG. 2C. C=colorectaladenocarcinoma; M=melanoma; P=pancreatic ductal adenocarcinoma.

FIG. 10 shows beeswarm plots for insert sizes>250 bp associated with thewild type (WT) and variant allele (VA) from each patient. Absence ofdata (e.g., C1) indicates that an allele with an insert size>250 bp wasnot detected for that patient. The solid black line for WT or VAidentifies the median insert size. The dark blue and light blue numbersidentify the total number of counts for WT and VA, respectively. Theidentifiers under each plot are matched to Table 1 and FIG. 2C. Thepercentages under each identifier indicate VAF using insert sizes>250bp. C=colorectal adenocarcinoma; M=melanoma; P=pancreatic ductaladenocarcinoma.

FIGS. 11A-F show the effect of size selection on VAF in spiked ccfDNAlibraries. Isolation of targeted ccfDNA fractions using ahigh-throughput automated gel-extraction platform. Distribution bydensitometry of the short (purple) and long (orange) fractions isolatedfrom healthy control unselected ccfDNA samples (black; N=7) is shown inFIG. 11A. Size includes full-length adapters (˜135 bp). Note theevidence of a tail in the short fraction (blue arrow) consistent withlonger fragments migrating with a shorter target fragment size. Althoughthe overall distributions overlapped, the peak fragment length of theshort fraction was significantly less than the long fraction (FIG. 11B).No significant difference was measured between the peak fragment lengthsof the long fraction and the unselected mononucleosome (FIG. 11B). Graynumbers indicate the mean±SD peak fragment length for each sample (FIG.11B). VAF determined by ddPCR for the EGFR T790M synthetic spike (130bp) for the short (purple) and long (orange) fractions and theunselected ccfDNA (black) are graphed in FIG. 11C. In the short fractionthe T790M allele remained detectable even when it was undetectable inunselected ccfDNA (FIG. 11C, inset), while it was virtually absent fromthe long fraction regardless of dilution. The enrichment factor in the‘small’ fraction was associated with extent of dilution with thegreatest amount of enrichment occurring in the most diluted samples(FIG. 11D; data only shown when unselected ccfDNA VAF was above thelimit of blank by ddPCR). VAF by ddPCR for the BRAF V600E syntheticspike (165 bp) is shown in FIG. 11E.

Overall, there was a general trend towards enrichment in both the short(FIG. 11E, purple) and long (FIG. 11E, orange) fractions. The variantwas present throughout the short samples except at the lowest dilutions(FIG. 11E, inset). Extent of enrichment was relatively consistentregardless of dilution (FIG. 11F). In FIGS. 11A-D, error bars indicatestandard deviation from independent duplicate experiments. *** P<0.001;NS=not significant; AFU=arbitrary fluorescent unit.

FIGS. 12A-D shows VAFs detected by ddPCR of size-selected and unselectedsynthetically spiked ccfDNA libraries. Pooled normal ccfDNA was spikedwith 130-bp EGFR T790M and 165-bp BRAF V600E synthetic gBlocks® andtruncated libraries were prepared from the spiked sample and itsunspiked reference pool. After creation of an eight-step dilution seriesof spiked with unspiked controls, the spiked libraries and unspikedreference were size selected. Full-length libraries were subsequentlyprepared from unselected samples and short and long gel fractions. VAFfor 130-bp EGFR T790M and 165-bp BRAF V600E was detected by ddPCR using50 ng of full-length library.

Effect of Size Selection on VAF in Spiked ccfDNA Libraries Using anAutomated Gel-Extraction Platform.

Distribution by densitometry of the short (purple) and long (orange)fractions isolated from healthy control unselected ccfDNA samples(black; N=7) is shown in (FIG. 11A). Size includes full-length adapters(˜135 bp). Note the evidence of a tail in the short fraction (FIG. 11A,arrow) consistent with longer fragments migrating with a shorter targetfragment size. Although the overall distributions overlapped, the peakfragment length of the short fraction was significantly less than thelong fraction (FIG. 11B). No significant difference was measured betweenthe peak fragment lengths of the long fraction and the unselectedmononucleosome (FIG. 11B). Gray numbers indicate the mean±SD peakfragment length for each sample (FIG. 11B). VAF determined by ddPCR forthe EGFR T790M synthetic spike (130 bp) for the short (purple) and long(orange) fractions and the unselected ccfDNA (black) are graphed in(FIG. 11C). In the short fraction the T790M allele remained detectableeven when it was undetectable in unselected ccfDNA (FIG. 11C, inset),while it was virtually absent from the long fraction regardless ofdilution. The enrichment factor in the ‘small’ fraction was associatedwith extent of dilution with the greatest amount of enrichment occurringin the most diluted samples (FIG. 11D; data only shown when unselectedccfDNA VAF was above the limit of blank by ddPCR). VAF by ddPCR for theBRAF V600E synthetic spike (165 bp) is shown in (FIG. 11E). Overall,there was a general trend towards enrichment in both the short (FIG.11E, purple) and long (FIG. 11E, orange) fractions. The variant waspresent throughout the short samples except at the lowest dilutions(FIG. 11E, inset). Extent of enrichment was relatively consistentregardless of dilution (FIG. 11F). In A-D, error bars indicate standarddeviation from independent duplicate experiments. *** P<0.001; NS=notsignificant; AFU=arbitrary fluorescent unit.

Automated Selection of Shorter ccfDNA Fragments Increased VAF

Size-based fractions were then isolated from the ccfDNA of patients withsolid tumors (Table 1) to characterize the effects on VAF as quantifiedby both ddPCR and NGS. Both short and long fractions were extracted fromPCR-amplified truncated ccfDNA libraries (1 μg; FIG. 13A). In a secondindependent run of PCR-amplified truncated ccfDNA libraries (1 μg), anintermediate fraction (i.e., medium) targeted between the short and longfractions was also isolated (FIG. 13A). The distribution ofsequencing-derived insert sizes from all short, medium, and longfractions, and unselected libraries are shown in FIG. 13B, whichdemonstrates a substantial amount of overlap between fractions. However,there was a statistically significant difference between fractions forboth the peak fragment length by densitometry (F(3,48)=99.4, P<0.001;FIG. 13C) and the median insert size by NGS (F(3,48)=283.9, P<0.001;FIG. 13D) indicating distinct subfractions of ccfDNA were isolated fromthe original mononucleosome distribution of fragment sizes. There was asignificant difference in the change of VAF between sub-fractionsrelative to the unselected library as determined by ddPCR (F(2,36)=5.4,P=0.009; FIG. 13E and FIG. 14) and sequencing (F(2,36)=17.7, P<0.001;FIG. 13F). A significantly larger increase was present in VAF for theshort fractions compared to the long fractions by both ddPCR (2.9±2.6vs. 0.8±0.5 fold-change, respectively; P=0.006) and NGS (2.0±0.8 vs.0.7±0.2 fold-change, respectively; P<0.001). In the NGS data, there wasalso an increase in VAF from the short fractions compared to the mediumfractions (2.0±0.8 vs. 1.3±0.5 fold-change, respectively; P=0.015) andthe medium fractions compared to the long fractions (1.3±0.5 vs. 0.7±0.2fold-change, respectively; P=0.013). Thus, selection of shorter ccfDNAfragments increased VAF and exclusion of longer ccfDNA fragments did notadversely affect VAF.

FIGS. 13A-E show enrichment of variant alleles in short ccfDNAfractions. In FIG. 13A, representative distributions by densitometry areshown of the isolated fractions (short—purple; medium—green;long—orange) from the original ccfDNA (black) of a single cancerpatient. The fragment lengths include full-length adapters (˜135 bp).The cumulative distribution of insert sizes at variant locations fromall patients for each sub-fraction (FIG. 13B) show a profile consistentwith densitometry (FIG. 13A). The peak fragment lengths from eachpatient by densitometry (FIG. 13C) and the median insert size bysequencing (FIG. 13D) were statistically significantly different betweeneach respective sub-fraction, while observations for the long fractionwere similar to the unselected mononucleosome (black). Enrichment forvariant alleles was greatest in the short fraction by both ddPCR (FIG.13E) and sequencing (FIG. 13F) with intermediate enrichment in themedium fraction (FIG. 13F). In the long fraction analyzed by bothmodalities there was a tendency for reduction in VAF (FIG. 13E and FIG.13F). Solid bars represent the mean value. In FIG. 13C-E, mean±SD valuesare shown in gray. * P<0.05; ** P<0.01; *** P<0.001; NS=not significant;AFU=arbitrary fluorescent unit.

FIGS. 14A-E show VAF by ddPCR in size-selected ccfDNA libraries.Full-length libraries were prepared from unselected samples and short,medium, and long fractions. VAF of known variant was determined by ddPCRfrom 50 ng of library. Primary ddPCR data plots and gated areasgenerated by the RD Analyst software are shown. Gates for wildtype andvariant droplet clusters were set on the unselected sample and appliedto patient-matched size selected samples.

The potential source of enrichment was explored using NGS data as eachWT or variant allele count was derived from a consensus readrepresenting a unique ccfDNA molecule. WT and variant counts were usedto determine the percent difference for each fraction relative to theunselected ccfDNA WT and variant counts was determined to identifyeffects of size selection on gain/loss of counts. Between ccfDNAsubfractions there was a significant difference in WT counts(F(2,36)=42.6, P<0.001; FIG. 15A). The short ccfDNA fractiondemonstrated the greatest reduction of WT counts at a mean of48.2±17.1%. In contrast, WT counts in the long ccfDNA fraction wasrelatively unchanged at an increase of 1.3±9.1%. For variant counts,there was also a significant difference (F(2,36)=6.9, P=0.003; FIG. 15B)between subfractions largely due to the 28.6±21.4% reduction in variantcounts present in the long ccfDNA fraction. The variant counts in theshort ccfDNA fraction was relatively unchanged at an increase of0.3±37.3% indicating loss of few variants during size selection. Ofnote, we also observed within each subfraction of ccfDNA a tendency forthe variant alleles to have shorter insert sizes and a broaderdistribution compared to WT alleles (FIG. 16). These findings incombination with earlier observations from unselected ccfDNA (FIG. 6C)support a greater proportion of ctDNA at shorter ccfDNA fragmentlengths. Thus, isolation of short ccfDNA fragments enriched for ctDNAthrough reduction of WT alleles without compromising variant alleledetection.

FIGS. 15A-B show the percent difference in wild type (WT) and variantcounts for each ccfDNA fraction relative to unselected ccfDNA counts.Compared to WT counts in unselected ccfDNA, there was a significantreduction in the short ccfDNA fraction compared to the medium and longccfDNA fractions (FIG. 15A). For the variant counts (FIG. 15B), therewas a significant reduction in the long ccfDNA fraction compared to themedium ccfDNA fraction and a strong trend to have fewer counts than theshort ccfDNA fraction. *P<0.05, **P<0.01, ***P<0.001.

FIG. 16 shows median insert size for the wild type (WT) and variantallele (VA) for each ccfDNA fraction. Within each sub-fraction of themononucleosome, there was evidence that the VA was shorter and had abroader distribution of insert sizes than the WT allele.

Automated Size Selection of ccfDNA Fragments Generated Larger FamilySizes

The effects of a priori physical size selection on read depth and familysize was next addressed. For all locations there was a statisticallysignificant difference in total reads amongst all sample types(F(4,60)=6.4, P<0.001), which was solely attributable to a modestincrease in the long ccfDNA fraction (FIG. 17A, FIG. 18A). This is animportant starting point as the similarity in total reads betweendifferent samples indicates the subsequent findings are not due toexperimental bias. A statistically significant difference in consensusaligned reads between groups was identified (F(4,60)=38.0, P<0.001).Buffy-coat DNA had the greatest number of consensus aligned reads, whileshort ccfDNA had the fewest (FIG. 17B, FIG. 18B). The on-target fractionwas significantly different between groups (F(4,60)=6.4, P<0.001)largely due to a slight lowering of the on-target fraction in shortccfDNA (FIG. 18C). Average family size was also significantly differentbetween sample types (F(4,60)=20.1, P<0.001). Average family size waslargest in the short ccfDNA fraction and smallest in buffy coat DNA(FIG. 17C). The family sizes for medium and long ccfDNA fractions weresignificantly larger than buffy coat DNA (FIG. 18D). As such, thereduction in sample complexity through isolation of ccfDNAmononucleosome fractions yielded larger average family sizes for asimilar number of total reads even in the context of a reduced on-targetfraction in the short ccfDNA fraction. This effect was then evaluated atthe known variant locations. Although buffy coat DNA had the largestconsensus read depth at family size≥1, there was a rapid reduction withincreasingly larger family sizes (FIG. 17D). Consensus read depthdecayed more slowly for unselected ccfDNA and short ccfDNA (FIG. 17D).At family size≥20, there was a statistically significant difference inconsensus read depth between sample types at the variant locations(F(4,60)=24.8, P<0.001; FIG. 18E). The short, medium, and long ccfDNAfractions demonstrated the greatest consensus read depth at familysize≥20 (FIG. 17E; FIG. 18E).

FIGS. 17A-E show the generation of large family sizes in short ccfDNA.Total reads were similar between sheared buffy coat DNA, unselectedccfDNA, and short ccfDNA (FIG. 17A). Consensus read depth (familysize≥1) was greatest in buffy coat DNA, followed by unselected ccfDNA,and then short ccfDNA (FIG. 17B). Average family size was greatest inthe short ccfDNA (FIG. 17C). At the specific variant locations for eachpatient, consensus read depth in buffy coat DNA rapidly decayed,reaching zero by family size≥20 (FIG. 17D, gray). In contrast, both theunselected ccfDNA (FIG. 17D, black) and the short ccfDNA (FIG. 17D,purple) showed fewer consensus reads at family size≥1, but maintained agreater read depth at larger family sizes (FIG. 17D, inset). Consensusread depth at family size≥20 was greatest in short ccfDNA (FIG. 17E). InFIGS. 17A-C and FIG. 17E, solid bars represent the mean value. In FIGS.17A-E, whiskers correspond to the standard deviation. *** P<0.001;NS=not significant.

FIGS. 18A-E show the generation of family sizes in buffy coat DNA,unselected ccfDNA, short, medium and long ccfDNA fractions. Overall,total reads were similar between sample types except for the long ccfDNAfraction where there was a significant increase (FIG. 18A). Consensusread depth (family size≥1) was greatest in buffy coat DNA and least inthe short ccfDNA fraction (FIG. 18B). The on-target fraction was similaracross all sample types except for the short ccfDNA fraction where therewas a significant decrease (FIG. 18C). Average family size was greatestin the short ccfDNA, while the family sizes in the medium and longfractions were significantly larger than the buffy coat DNA (FIG. 18D).At the specific variant locations for each patient, consensus read depthat family size≥20 was greatest in the short, medium, and long fraction(FIG. 18E). In FIGS. 18A-E, solid bars represent the mean value andwhiskers correspond to the standard deviation. *** P<0.001; ** P=0.01; *P<0.05; NS=not significant.

Larger Family Sizes Reduced False Positives

The association between family size and false positives was alsoaddressed. During a targeted search for the corresponding known variantin the buffy coat DNA from each patient, few false positives wereidentified (FIG. 19A). We then analyzed unselected ccfDNA from 11healthy controls sequenced under identical conditions. As with thepatient samples, the greater of 10 ng or 1 mL plasma equivalent ofccfDNA was used for the initial library input. The mean amount of ccfDNApresent in the healthy controls was 11.6±4.7 ng/mL plasma (median: 11.3ng/mL plasma; range: 3.8-15.3 ng/mL plasma). While there was a trend fora larger amount of ccfDNA per mL plasma in patients, the difference wasnot statistically significantly greater than the controls most likelydue to the large variation in the patient data and sample sizeassociated with each cohort (20.1±14.5 vs. 11.6±4.7 ng/mL plasma,respectively; P=0.07). Similarly, few false positives were found for theknown patient variants in healthy control ccfDNA (FIG. 19B). In bothpatient buffy coat DNA and control ccfDNA the allele frequency for knownvariants was <0.01% suggesting that constrained searches of knownvariants may be associated with a low error rate. Of note, we alsoobserved that family size in the control unselected ccfDNA wassignificantly larger than patient unselected ccfDNA (8.4±2.7 vs.4.9±1.2, respectively; P<0.001), which could not be explained bydifferences in total reads or on-target fractions (FIGS. 10A-D). Thislatter finding supports the supposition that reduction in samplecomplexity through size selection generates larger family sizes aspatient-derived ccfDNA is expected to be more complex than controlccfDNA due to contributions from tumor cells, higher concentration ofccfDNA present in plasma, or both.

FIG. 19A-F show the reduction of false positives at larger family sizes.Corresponding variants present in patient ccfDNA were queried in matchedbuffy coat DNA (FIG. 19A). False positives were few and incrementallydecreased with larger family sizes (FIG. 19A). FIG. 19B shows thecumulative number of false positives from all healthy control ccfDNA andall five targeted patient variants is shown. Overall, only two falsepositives were identified. In (FIG. 19C), the mean error rate across theentire capture panel (128 genes, 128 kb) decreased with increasinglylarger family sizes. Total consensus aligned counts for non-referencealleles with AF<0.1% (D), 0.1%≥AF≤1.0% (E), and 1.0%>AF≤2.0% (FIG. 19F)are shown (black circles). In (FIG. 19E) and (FIG. 19F), non-referencealleles are sub-categorized as “unique” (blue squares) or “shared” (graytriangles). In (FIG. 19F), “shared” non-reference alleles are not shownas they are similar to the total count. In (FIG. 19F), the “unique”non-reference allele count is plotted on a second y-axis. In (FIGS.19C-F), whiskers correspond to the standard deviation.

FIGS. 20A-D show comparison of coverage, on-target fraction, and familysize between unselected ccfDNA from healthy controls and patients.Although total reads (FIG. 20A), consensus read depth (FIG. 20B), andon-target fraction (FIG. 20C) were significantly higher in the patientcohort, the average family size was largest in the controls (FIG. 20D).In FIGS. 20A-D, solid bars represent the mean value and whiskerscorrespond to the standard deviation.

The aligned base error rate in the control ccfDNA was evaluated to studyoccurrence of false positives during untargeted searches using the 128gene (128 kb) panel. Globally, at family size≥1 the mean error rate was0.011±0.002% and there was a reduction in error with incrementallylarger family sizes (FIG. 19C). At family size≥20 the mean error ratewas significantly reduced by 57.0±7.7% (P<0.001). Although the globalmean error rate provides an overall metric for quality of sequencing,the principal source of false positives during NGS detection of very lowfrequency variants are due to local errors associated with stochasticnoise or position-specific common errors. Locally, non-reference allelecounts in control ccfDNA similarly reduced with incrementally largerfamily sizes (FIG. 19D-F). At a non-reference allele frequency≥0.1% thedata was parsed into “unique” and “shared” locations. A shared locationwas defined as the presence of a non-reference allele in at least threecontrol ccfDNA samples, thus unique locations were representative ofstochastic noise. The majority of non-reference alleles detected at afrequency≥0.1% and ≤1.0% were due to unique rather than shared locations(FIG. 19E). Within this frequency range there was a significantreduction of 81.6±7.3% (P<0.001) in unique non-reference allele countsbetween family size≥1 and family size≥20. Non-reference alleles detectedat a frequency>1% and ≤2.0% were relatively few and largely due toshared locations (FIG. 19F). However, it is notable that on averagethere were ˜2 non-reference unique variants in the control ccfDNA with afrequency>1% and ≤2.0% present even at large family sizes (FIG. 19F).Combined, these findings indicate stochastic sequencing noise and/or PCRerrors may confound identification of true variant alleles duringuntargeted searches. Regardless, the control data provides compellingevidence that generation of large family sizes improves in silico errorreduction.

VAF Remained Constant in Shorter ccfDNA Fractions at Larger Family Sizes

The effects of family size on VAF was investigated since larger familysizes were associated with a reduced consensus read depth. In theunselected ccfDNA, VAF remained relatively constant up to familysize≥10; however, VAF subsequently became inconsistent at larger familysizes (FIGS. 21A-B). In ˜46% of patients (6 of 13) the variant allelewas lost before family size≥20 (FIG. 21B). In contrast, the VAF in theshort ccfDNA fraction was consistent up to a family size≥20 without lossof variant detection in any patient (FIGS. 21C and 21D). The absolutevalue of relative percent change in VAF was similar for unselectedccfDNA and short ccfDNA at family sizes≥5 and ≥10 (FIG. 21E). Therelative percent change was significantly larger in the unselectedccfDNA compared to the short ccfDNA fraction at a family size≥15 and afamily size≥20 (FIG. 21E). As such, VAF remained more consistent atlarger family sizes in the short ccfDNA fraction than in the unselectedccfDNA regardless of initial VAF magnitude. The medium and long ccfDNAfractions exhibited similar improvement in VAF consistency at largerfamily sizes relative to the unselected ccfDNA (FIGS. 22A-C and FIGS.23A-C, respectively), which further supports the strengths of reducingsample complexity to improve sensitivity even though each fraction wasassociated with loss of variant detection in at least one patient byfamily size≥20. Thus, the continued detection of low frequency variantsat large family sizes in the short ccfDNA fraction may have beensupported by the combined effects of increased consensus read depth andvariant enrichment for the total number of reads used in this study.

FIGS. 21A-E show the effects of family size on VAF. Overall, VAF wasrelatively stable up to a family size≥10 in unselected ccfDNA (FIG.21A). However, at larger family sizes VAF became less stable andincluded complete loss of variants in some samples (FIG. 21B,magnification of area in blue box shown in FIG. 21A). Of note, completeloss of the variant allele occurred in one sample with an initial VAF>5%(FIG. 21A and FIG. 21B, black arrow). In contrast, VAF remainedrelatively stable up to family size≥20 in the short ccfDNA fraction(FIG. 21C, magnification of area in the box shown in FIG. 21D). Note theapparent increase of VAF in the short ccfDNA fraction at lower allelefrequencies (FIG. 21D) compared to the unselected ccfDNA (FIG. 21B). Therelative percent difference in VAF was similar in unselected and shortccfDNA at family size (FS)≥5 and FS≥10 (FIG. 21E). However, the relativepercent difference was statistically significantly lower in the shortccfDNA fraction at FS≥15 and FS≥20 (FIG. 21E). * P<0.05; *** P≤0.001;NS=not significant.

FIGS. 22A-C show the effects of family size on VAF in the medium ccfDNAfraction. Overall, VAF was relatively stable up to a family size≥15 inthe medium fraction of ccfDNA (FIG. 22A). However, at larger familysizes VAF became less stable and included complete loss of variants insome samples (FIG. 22B, magnification of area in the box shown in FIG.22A). The relative percent difference in VAF was similar in unselectedand medium ccfDNA at family size (FS)≥5 and FS≥10, but was significantlylarger in unselected ccfDNA at FS≥15 (FIG. 22C). At FS≥20, there was atrend for a larger difference of VAF in the unselected ccfDNA, but itwas not statistically significant. * P<0.05; NS=not significant.

FIGS. 23A-C show the effects of family size on VAF in the long ccfDNAfraction. Overall, VAF was relatively stable in the long ccfDNA fraction(FIG. 23A) even at large family sizes and lowest VAFs (FIG. 23B,magnification of area in the box shown in FIG. 23A). Of note, in onesample the variant allele was lost at FS≥6. The relative percentdifference in VAF was similar in unselected and long ccfDNA at familysize (FS)≥5, FS≥10, and FS≥15, but was significantly larger inunselected ccfDNA at FS≥20 (FIG. 23C). *** P≤0.001; NS=not significant.

Methods Patient Samples and DNA Isolation

Healthy adult volunteers and cancer patients with a BRAF or KRAS solidtumor variant associated with a primary melanoma, pancreatic ductaladenocarcinoma, or colorectal adenocarcinoma were recruited forenrollment. Blood samples were collected in BCT tubes (Streck, La Vista,Nebr.) and processed for buffy coat and plasma extraction within 24hours. The buffy coat and plasma were separated from whole blood bycentrifugation at 1,900 g×10 minutes at 4° C. and aspirated to newtubes. Plasma was then centrifuged at 16,000 g×10 minutes at 4° C. toremove any cellular debris. The plasma supernatant and the buffy coatwere stored at −80° C. until further use. Buffy coat DNA (i.e., whiteblood cell DNA) was isolated from the buffy coat using the QIAamp DNAMini Kit (Qiagen, Germantown, Md.) and eluted in a final volume of 100μL 10 mM Tris-Cl and 0.5 mM EDTA (pH 9.0). 100 ng of buffy coat DNA wasthen sheared using a focused-ultrasonicator (S220, Covaris, Woburn,Mass.) with a targeted size of 175 bp. ccfDNA was isolated from 8 mL ofplasma using the QIAamp Circulating Nucleic Acid Kit (Qiagen) and elutedin a final volume of 50 μL 10 mM Tris (pH 8.0) and 0.1 mM EDTA. ccfDNAwas not sheared.

NGS Library Preparation, Sequencing, and Bioinformatics

Libraries for buffy coat DNA (100 ng) and ccfDNA (10 ng or the quantityequivalent to ccfDNA from 1 mL of plasma, whichever was greater) wereprepared using the Kapa Biosystems Hyper Prep Kit for end repair,A-tailing, and ligation of truncated custom IDT adapters that containedan eight base-pair random barcode (i.e., unique molecular identifier) inthe index 2 position of a standard Illumina adapter. The Kapa HiFi 2×master mix with truncated-length adapter primer was used for initiallibrary amplification followed by the use of full-length indexingprimers during subsequent PCR amplification steps.

Full-length buffy coat DNA and ccfDNA libraries were enriched forregions of interest using a custom designed IDT Xgen capture probe set(Integrated DNA Technologies) containing full exonic or hotspot coverageof 128 genes (128 kb). Paired-end sequencing (2×125 bp) of libraries wasperformed on an Illumina HiSeq 2500. Reads in FASTQ files were alignedto the GRCh37 reference genome and those with the same unclippedalignment start position were grouped into families based on >0.875molecular barcode similarity. Read sequence was extracted from eachfamily and consensus called on each base position. Those with >0.66concordance were assigned the predominant base, otherwise, an N. See theconsensus aligned workflow in FIG. 24. Fragment length was derived frompaired-end alignment information according to SAM format. Identificationof wild type vs. variant allele was determined by a 100% match to an 11bp string within aligned consensus sequences at the locationcorresponding to each known variant. Additionally, aligned base errorrates and occurrence of localized false positive variants werecalculated using our open source EstimateErrorRates and MpileupParserapplications. The USeq EstimateErrorRates application calculates baselevel error rates observed in quality alignments (≥MQ20) from normalgermline sequencing datasets. It parses a Samtools mpileup alignmentstack for regions of 7 adjacent bases with adequate read depth (≥100 Q20bases), no observed indels, and no indication of heterozygous orhomozygous SNVs (allele frequencies≤0.1). Good quality (≥Q20),non-reference, center base observations in each passing region aretabulated. These are used to calculate error rates for each base as wellas the total error observed from quality alignments and quality bases.The USeq MpileupParser works in a similar fashion by parsing a Samtoolsmpileup alignment stack covering bases in a bed file of the 128 kbcapture panel coverage with 25 base pair padding. Only qualityalignments (≥MQ20) and quality bases (≥Q20) are counted. Locations withevidence of a heterozygous or homozygous allele (AF>0.1) are ignored. Itoutputs a bed file of each passing base with its observed non-referenceallele frequencies. At FS≥1, allele frequencies were binned (<0.1%, 0.1%to 1.0%, and 1% to 2.0%) and then tracked for presence/absence atsubsequent family sizes.

FIG. 24 shows an example of a consensus alignment workflow. A Snakemakeworkflow was constructed to convert fastq sequencing datasets withunique molecular identifiers to processed alignments. This involved 18steps as represented in this directed acyclic graph. In brief,alignments are generated with bwa. Those with the same unclipped startposition are grouped by UMI and collapsed to a single error correctedconsensus sequence with USeq tools. These are aligned, merged, andpassed through GATK's best practice INDEL realignment and base scorerecalibration process. Throughout, various quality control files aregenerated including a unique observation read coverage data track.

Fragment Size Selection

Selection of fractions from truncated ccfDNA libraries was done with anautomated liquid handler (NIMBUS Select, Hamilton, Reno, Nev.) thatincorporated Ranger Technology (Coastal Genomics, Burnaby, BC) for themonitoring and real-time manipulation of electrophoretic mobilitiesthrough a 3.0% agarose matrix in a 12-channel cassette. Prior to use onhuman samples, extraction parameters were optimized with a four-rungladder constructed from lambda phage using Hot Start Taq DNA polymerase(Roche, New York, N.Y.) and the following primer pairs to generatespecific lengths of lambda DNA:

278 bp: [SEQ ID NO: 01] 5′-GATGCGATGTTATCGGTGCG-3′ and [SEQ ID NO: 02]5′-CACAGGTGAGCCGTGTAGTT-3′ 268 bp: [SEQ ID NO: 03]5′-TGGAACCCACCGAGTGAAAG-3′ and [SEQ ID NO: 04]5′-CAATGCAGCAGCAGTCATCC-3′ 233 bp: [SEQ ID NO: 05]5′-CGGCACGATCTCGTCAAAAC-3′ and [SEQ ID NO: 06]5′-GCCTTGAACTGAAATGCCCG-3′ 223 bp: [SEQ ID NO: 07]5′-GGAAGCTGCATGATGCGATG-3′ and [SEQ ID NO: 08]5′-CTGGTGCGTTTCGTTGGAAG-3′

Ladder lengths were constructed to guide targeting of desired ccfDNAfragment lengths after the addition of the truncated adapters (˜103 bp;FIG. 6A). The short fraction was optimized to extract a ccfDNA fractionthat included both the 223 and 233 bands and no portion of the 268 band,while the long fraction was optimized to include both the 268 and 278bands, but not the 233 band. After optimization, PCR-amplified truncatedccfDNA libraries (1 μg; FIG. 6A) were loaded into the cassette (CoastalGenomics) and short and long fractions were collected from a single run.A second run using intermediate parameters to collect a medium fractionbetween the short and long fractions from PCR-amplified truncated ccfDNAlibraries (1 μg) was also performed. Collected fractions were mixed withQG buffer (Qiagen; 1.4 volumes to 1 volume of sample) and loaded onto aQIAquick spin column from the QIAquick PCR Purification Kit (Qiagen).The remaining manufacturer's instructions for the kit were then followedand ccfDNA library fractions were eluted in 30 μL of EB buffer (Qiagen).From the eluate, 20 μL was used with full-length indexing primers duringPCR amplification in preparation for sequencing (FIG. 6A). Densitometry(TapeStation 2200, Agilent Technologies) was used to characterize ccfDNAfragment distribution from unselected and size-selected ccfDNAfull-length libraries at a loading concentration 5 ng/p L.

Droplet Digital PCR

Droplet digital PCR (ddPCR) assays were performed on the RainDrop Plus™Digital PCR System (Bio-Rad). For detection of EGFR T790M and KRAS G13Dpublished assays were used. For additional assays, primer pairs weredesigned with a target amplicon size<100 bp to accommodate amplificationfrom cell-free DNA samples. Dual-color (FAM/TET) hydrolysis probescontaining locked nucleic acid (LNA) nucleotides, 3′ terminal and/orinternal quenchers (Iowa Black/ZEN) were designed to distinguishwildtype from mutant alleles. All primers and probes were sourced fromIntegrated DNA Technologies. Reactions were set up in a final volume of25 μL using TaqMan Genotyping Master Mix (Life Technologies). Primerswere added to a final concentration of 500 nM, probes to finalconcentrations of 100 nM (BRAF) or 200 nM (all other assays). Up to 10.5μL of template DNA was tested containing 50 ng of amplified sequencinglibraries or varying amounts of cell-free DNA (range: 7-46 ng). Falsepositive noise and limit of blank (LOB) of all assays was determinedfrom a collection of wild-type-only samples and no-template controls(FIG. 25A-C). Data were analyzed using RD Analyst software.

FIGS. 25A-C show false positive droplet events in control samples. Foreach ddPCR assay false positive droplet events were measured in acollection of controls (FIG. 25A). Samples tested in each assay includedfull-length libraries (n≥11), plasma cell-free DNA (n≥9), buffy coat DNA(n≤3) and no template controls (n≤3). A Poisson model was applied to fitthe observed false positive distribution (dashed line). The mean of thePoisson distribution (λ) was determined and the limit of blank (LOB) foreach assay was calculated from the 95% confidence interval of thePoisson distribution as well as from the 95% limit of the empiricaldistribution. False positive variant allele frequency (VAF) wasdetermined for each control experiment, excluding no template controls(FIG. 25B). Median VAF, interquartile range and 95 percentile (errorbars) of false positive VAFs for each assay are indicated. Data aresummarized in table format (FIG. 25C).

Size Selection of Synthetically Spiked ccfDNA

Synthetic DNA gBlocks® including 130 bp of genomic EGFR sequencespanning the c.2369C>T (p.T790M) point mutation and 165 bp of genomicBRAF sequence spanning the c.1799T>A (p.V600E) mutation were purchasedfrom Integrated DNA Technologies. gBlocks® were reconstituted in TEbuffer, serially diluted and quantified by ddPCR to determine absolutecopy number. Sufficient 130 bp EGFR T790M and 165 bp BRAF V600E gBlocks®were spiked into a sample of pooled cell-free DNA collected from healthydonors to yield a target VAF of ˜10% for both alleles. 10 ng of spikedcell-free DNA and 50 ng of the corresponding unspiked pooled ccfDNA wereused for NGS truncated library preparation as described above. Thepresence of synthetic mutations in the spiked library were verified byddPCR (EGFR T790M 11.4% VAF; BRAF V600E 12.1%). Truncated-lengthlibraries of spiked and unspiked samples were subsequently mixed togenerate an eight-step serial dilution series. Two independent dilutionseries were produced. 1 μg of each dilution and unspiked controllibraries were size-selected for isolation of short and long fractionsas described above. Full-length libraries were produced from extractedfractions and unselected samples and analyzed for EGFR T790M and BRAFV600E VAF by ddPCR.

Statistics

For paired samples, the paired t-test was applied. The independentt-test was used for comparison of two independent samples and Levene'stest for inequality determined equal or unequal variance. For multiplesamples, one-way analysis of variance (ANOVA) was applied followed by aTukey post-hoc test for comparisons between pairs of samples. Pearson'scorrelation coefficient (r) evaluated associations between samples.Boxplots show the median value and the 25^(th) and 75^(th) quartiles.Whiskers on boxplots identify the 5^(th) and the 95^(th) percentiles.For comparison of VAF between different family sizes, the absolute valueof relative percent change was calculated to weight all changes in VAFsimilarly. For comparison of WT and variant counts between ddPCR andNGS, the percent change relative to ddPCR was calculated to normalizethe data to account for differences in counts between samples. Allstatistical analysis was performed in SPSS (Version 24, IBM).Statistical significance was defined as P<0.05.

1. A method of increasing detection of low-abundant fragments ofcell-free DNA (ccfDNA) in a biological sample from a subject,comprising: isolating an initial fraction of ccfDNA fragments from abiological sample; ligating a unique molecular identifier (UMI) to eachof the ccfDNA fragments in the initial fraction; amplifying theplurality of ccfDNA fragments to generate a ccfDNA library; isolating ashort fraction of ccfDNA fragments from the ccfDNA library, where theccfDNA fragments in the short fraction are limited to a size of lessthan or equal to 160 base pairs (bp); amplifying the ccfDNA fragments inthe short fraction; and sequencing the ccfDNA fragments in the shortfraction to generate sequenced ccfDNA fragments.